Skip to content

train: add simple loading already tokenized data from parquet dataset #14522

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

lexasub
Copy link
Contributor

@lexasub lexasub commented Jul 3, 2025

also we need add streaming/batching, but this more complex task:)

@github-actions github-actions bot added build Compilation issues examples labels Jul 3, 2025
@lexasub lexasub force-pushed the parquet2 branch 2 times, most recently from 1bb0911 to 2574024 Compare July 3, 2025 20:22
@lexasub lexasub marked this pull request as draft July 3, 2025 20:23
@lexasub
Copy link
Contributor Author

lexasub commented Jul 8, 2025

@JohannesGaessler what about my changes?)

@JohannesGaessler
Copy link
Collaborator

Sorry for the late reply. Generally speaking I would greatly prefer it if the training data were to be stored as GGUF files. That will make my life as a maintainer much easier since I won't have to deal with external dependencies.

How about this: come up with a standardized way to define training data as GGUF, write code for constructing an ggml_opt_dataset from GGUF, write code for converting text/Parquet to GGUF (can be Python). In the GGUF file, define one tensor for each sequence of characters or tokens. Streaming can be achieved by at first loading only the metadata and loading the tensor data as needed. I'm not yet 100% sure what the specification for the metadata should be.

@lexasub
Copy link
Contributor Author

lexasub commented Jul 8, 2025

some ~

metadata: {
  "num_sequences": 100000,
  "vocab_size": 32000,
  "max_seq_len": 2048,
  "tokenizer": "llama"
}
tensors: [
  { "name": "seq_0", "shape": [2048], "data": [...] },
  { "name": "seq_1", "shape": [1536], "data": [...] },
  ...

@JohannesGaessler
Copy link
Collaborator

Preferably use a prefix for the metadata and tensors. Looking at llama-arch.cpp how about e.g. dataset.num_sequences? Other than that the layout LGTM.

@lexasub
Copy link
Contributor Author

lexasub commented Jul 8, 2025

We will use the training. prefix for all keys to avoid conflicts with model metadata.


training.format.version: string (e.g. "1.0") - Specification version, in case of future changes.

training.dataset.name: string (optional) - Dataset name (e.g. "OpenWebText-ru").

training.dataset.source: string (optional) - URL or description of the data source.

training.file.creation_date: string (ISO 8601) - File creation date.

training.tokenizer.gguf.model: string - Tokenizer model name (llama, gpt2, etc.).

training.tokenizer.gguf.vocab: array[string] - Tokenizer dictionary.

training.tokenizer.gguf.merges: array[string] - Tokenizer merges (for BPE).

training.tokenizer.gguf.pre: string (optional) - Pre-tokenization architecture.

Note: Instead of storing the entire tokenizer, you could reference the model file, but embedding ensures that the data file is completely self-contained.

training.sequence.count: uint64 - Total number of sequences in the file.

training.sequence.lengths: array[uint32] - Key field! An array containing the length of each sequence in tokens. This will allow for efficient "bucketing" (grouping sequences of similar length) in the future.

Tensors

Naming: training.tensor.{index} (e.g. training.tensor.0, training.tensor.1, ...).

Data type: GGML_TYPE_I32 (standard for tokens in llama.cpp).

Shape: [sequence_length] - One-dimensional array. sequence_length will be different for each tensor.

@JohannesGaessler
Copy link
Collaborator

I don't think you need an array with the sequence lengths per tensor since you can just query the shape of a tensor. I think it's enough to store the maximum sequence length (could also get this from iterating over tensors).

Consider that people may also want to store untokenized datasets, I would suggest using uint8 + metadata for the encoding for those cases (it's fine if this use case is not implemented in this PR).

@lexasub
Copy link
Contributor Author

lexasub commented Jul 10, 2025

dirty implementation of converter to new format https://github.com/lexasub/llama.cpp/tree/finetune-backup )

@lexasub
Copy link
Contributor Author

lexasub commented Jul 10, 2025

@JohannesGaessler firstly i add support of gguf - dataset #14622

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues examples
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants